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SUMMARY 


A  reduction  in  the  minimum  attainable  feature  size  in 
integrated  circuits  huS  lead  to  the  possibility  of  more  and 
more  complex  circuits  being  built  on  a  single  chip  (VLSI). 
This  technological  advance  brings  with  it  the  need  to  make 
these  circuits  fault  tolerant:  to  increase  yield  and 
reliability  and  to  reduce  testing  times.  This  Memorandum 
briefly  reviews  current  techniques  for  designing  fault 
tolerant  circuitS^before  concentrating  on  a  new,  high-level 
fault  tolerance  technique:  algorithmic  fault  tolerance. 

The  concept  of  algorithmic  fault  tolerance  is  explained 
and  various  techniques  are  reviewed  with  regard  to  their 
suitability  for  providing  fault  tolerance  for  signal 
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1  Introduction 


The  use  of  high-density  integrated  circuit  manufacturing  aethods  (VLSI  and 
above)  vhilst  bringing  aany  advantages  to  Integrated  circuit  design  also  create 
several  nev  problems.  The  increase  in  complexity  of  the  devices  being  produced, 
coupled  with  limited  access  to  Internal  circuit  nodes,  means  that  testing  the 
device  is  increasingly  more  difficult,  both  in  the  factory  and  in  the  field. 

The  reduction  in  geometry  size,  that  allows  this  increased  packing  density, 
means  that  defects  in  the  silicon,  or  any  of  the  numerous  layers  Involved  in 
modern  IC  construction,  are  more  likely  to  result  in  the  malfunction  of  a 
transistor  [1].  The  gate  that  contains  this  transistor,  and  ultimately  the 
device  Itself,  vill  thus  exhibit  a  fault.  The  increased  likelihood  of  a 
defective  transistor  is  therefore  reflected  in  a  poorer  process  yield. 

A  third  problem  is  that  of  soft  failures.  These  are  faults  that  exist 
intermittently,  affecting  the  device  for  a  finite  period  before  disappearing, 
possibly  to  return  at  some  later  tine.  Possible  causes  of  such  a  phenomenon 
include  the  alteration  of  the  charge  on  a  transistor's  gate  by  cosmic  radiation 
or  a-particle  emission,  the  latter  usually  being  due  to  the  case  material  (this 
is  particularly  common  in  solid  state  memory  devices);  electron  tunnelling 
through  thin  barriers;  electromagnetic  pick-up  within  the  device  and  capacitive 
coupling  between  adjacent  circuit  elements.  The  occurrence  of  such  a  fault 
renders  the  device  strictly  useless  since  no  confidence  can  be  placed  in  its 
operation. 

These  soft  hardware-faults  can  be  augmented  by  a  similar  set  of  software 
faults.  The  complexity  of  VLSI  devices  is  such  that  the  components  of  an 
equipment  can  nov  include  microprocessors.  These  will  be  under  software  control 
and  thus  nay  make  mistakes  due  to  short  comings  in  the  program  design.  It  is 
probably  fair  to  assume  that  the  program  has,  to  some  extent,  been  tested  and  so 

works  under  most  conditions.  It  is  possible,  however,  that  certain  (rare?) 

conditions  exist  under  vhlch  the  program  vill  produce  an  erroneous  result, 

returning  to  normal  operation  once  these  conditions  have  been  removed. 

To  the  class  of  soft  hardvare-fallures,  the  more  usual  one  of  hard  or 
permanent  faults  may  be  added.  These  include  the  veil  knovn  problems  of  circuit 
failure  due  to  ageing  and  shock  and  vibration.  The  distinction  between  soft  and 
hard  failures  is  that  in  the  latter  case  the  fault,  once  manifested,  remains  in 
evidence.  Again  the  high  densities  involved  in  VLSI  mean  that  such  faults  are 
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■ore  likely  to  occur,  thus  shortening  the  life  of  the  device. 


The  net  effect  of  all  of  the  above  points  is  that  to  gain  the  full  benefit  of 
these  high-density  fabrication  methods,  or  to  sake  software  controlled  equipment 
■ore  reliable,  it  is  becoming  increasingly  important  to  be  able  to  design  "fault 
tolerant*  systems.  One  possible  solution  to  these  problems  is  some  form  of  error 
detection  coupled  vith  an  in-built  redundancy.  The  error  detection  capability 
effectively  provides  a  self-test  capacity  for  the  device,  even  after  it  has  left 
the  factory,  whilst  the  redundancy  in  the  system  can  be  used  mask-out  any  fault 
thus  enabling  the  device  to  continue  to  function  correctly.  A  "static*  fault 
tolerant  device  that  can  mask-out  faults  under  non-operational  conditions  will 
clearly  improve  yield  and  device  lifetime  (if  the  self-repair  process  is  used 
Intermittently).  To  protect  against  the  soft  failure,  however,  requires 
concurrent  self-repair  i.e.  the  ability  to  detect  and  mask  faults  in  parallel 
with  normal  operations. 

2  Fault  Tolerant  Techniques 

There  are  many  different  approaches  {2](3]  to  the  task  of  making  a  system 
fault  tolerant.  A  significant  factor  is  the  type  of  system  being  considered  and 
the  literature  can  accordingly  be  subdivided  on  the  basis  of  the  category  into 
which  the  system  being  considered  falls.  A  list  of  categories  and  some  examples 
of  both  the  category  and  the  fault  tolerant  technique  can  be  found  in  table  1. 

It  should  be  realised  that  in  some  cases  the  distinction  is  artificial,  and  that 
the  edges  betveen  categories  are  often  blurred. 

The  majority  of  the  work  in  the  field  of  fault  tolerant  design  has  been 
concerned  with  the  design  of  special  purpose  hardware  or  switching  algorithms. 
The  former  category  includes  such  circuits  as  AN-coded  adders  (see  below).  The 
idea  here  is  to  create  fault  tolerant  systems  by  constructing  them  from  fault 
tolerant  components.  In  the  case  of  the  switching  algorithms  the  interest  is 
usually  in  adaptive  configuration/routing  in  computer-based  communication 
networks  (meaning  collections  of  complex,  usually  software  controlled, 
processing  elements).  The  object  of  the  netvork  may  be  anything  from 
lnter-mainframe-computer  communication  to  dedicated  hardware  for  systolic 
algorithms. 
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Category 

Typical  Members 

Technique 

Lov  Level  Circuits 

General  circuits 
Modulo  Reduction 
Adders/Mul t i pli ers 
Memories/data  buses 

Triple  Modular  Redundacy 
Self-checking  Checkers 
Arithmetic  Codes 

BCH  Codes 

High  Level  Circuits 

FIR  Filters 

Matrix  Maths.  3 
Accelerators  ) 

BCH  Codes 
(  Matrix  Checksums 
l  Invariance  Properties 

Equipment 

Systolic  Array 

Multiprocessors 

(  Hardvare  Controlled 
(.  Reconfiguration 
(  Softvare  Controlled 
l  Reconfiguration 

Sof  tvare 

Computer  Netvork 

Subroutine 

(  Svltched  Packet 
l  Netvork 

Re- try  Schemes 

Table  1.  Fault  Tolerant  Techniques. 


Very  little  interest  has  been  shovn  in  the  problem  of  algorithmic  fault 
tolerance  i.e.  the  design  of  hardvare- independent  fault  tolerant  mathematical 
algorithms.  Such  algorithms  are  concerned  vith  the  problem  of  achieving  error- 
free  results  in  mathematical  calculations  vithout  having  to  specify  the  exact 
nature  of  the  hardvare.  If  a  particular  mathematical  calculation  vere  to  be 
implemented  in  such  an  algorithm,  either  in  softvare  (e.g.  general  purpose 
computer)  or  hardvare  (e.g.  systolic  array),  any  fault  vith  the  underlying 
system  (ALU  or  processing  elements)  vill  be  detected  and,  ideally,  corrected 
(see  figure  1).  Of  the  techniques  mentioned,  those  based  on  arithmetic  codes, 
BCH  codes,  matrix  checksums  and  matrix  invariance  properties  are  most  suitable 
for  such  an  approach.  These  and  several  other  techniques,  vhich  also  appear  to 
hold  some  promise,  are  discussed  further  in  section  3. 
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Figure  1.  Algorithmic  Fault  Tolerance. 
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3  Algorithmic  Fault  Tolerance 
3.1  Literature  Survey 

Abraham  [ 4 ] ( 5 ]  and  Luk  [ 6 J [ 7 J  proposed  the  addition  of  row  and  column 
checksums  to  matrices  in  order  to  provide  error  protection  in  various  matrix 
operations  (multiplication,  addition,  LU  and  OR  decomposition).  The  checksums 
are  nothing  more  than  the  sum  of  the  elements  in  a  rov  or  column  but  as  Abraham 
noted  these  sums  are  invariant  under  the  matrix  operations  mentioned.  In  a  more 
recent  paper  [8]  Abraham  also  briefly  considered  hov  to  exploit  any  properties 
of  the  matrix  that  remain  invariant  under  the  given  operation,  in  the  detection 
of  errors  e.g.  checking  for  a  change  in  the  norm  of  a  rov  or  column  vector  under 
a  unitary  transformation. 

It  has  long  been  known  that  the  convolution  operation  is  equivalent  to  the 
multiplication  of  polynomials  and  this  is  exactly  the  frame  work  used  in  the 
theory  of  BCH  error-correction  codes.  Redinbo  |9)  used  these  facts  to 
incorporate  error-correction  in  an  PIR  filter.  As  a  result  of  this  study  he  also 
noted  that  the  so-called  chord  property1  of  the  number  theoretic  transform 
effectively  Implies  redundancy  and  so  can  also  be  used  for  error  protection. 

The  AN  and  residue  codes  [10] [ 11] [ 12]| 14J  have,  up  to  nov,  been  considered 
only  in  conjunction  with  special  purpose  hardvare  to  construct  fault  tolerant 
arithmetic  circuits.  There  is  no  reason,  hovever,  vhy  these  coding  methods 
cannot  be  used  at  the  algorithmic  level  in  much  the  same  vay  as  the  BCH  codes 
vere  by  Redinbo.  In  fact  AN  codes  are  entirely  analogous  to  BCH  codes,  both 
codes  being  cyclic  codes  (i.e.  ideals  of  a  finite  ring2:  Z2n  j  and 
GF(q)[xJ/<xn-l>  respectively). 

There  are  also  several  other  areas  of  mathematics  that  contain  results  that 
may  be  of  some  use  in  the  field  of  algorithmic  fault  tolerance.  These  and  the 
above  mentioned  techniques  are  revieved  in  the  folloving. 


1.  necessary  existence  of  conjugates. 

2.  Z  is  the  ring  of  integers  modulo  n. 

GF(q”(x]/<xn-l>  is  the  ring  of  polynomials  vith  coefficients  taken  from  the 
Galois  field  vith  q  elements,  modulo  the  polynomial  x-l. 
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3.2  AN  and  BCH  Codes 


The  basis  of  the  use  of  cyclic  codes  for  error  protection  is  as  follows.  Let 
I  be  a  "commutative  ring  with  unity"  (e.g.  the  ring  of  integers  modulo  a  given 
integer)  and  consider  the  addition  of  a'  «  g-a  e  R  and  x’  •  g-x  e  R,  for 
arbitrary  a  and  x  and  some  given  g  (see  figure  2). 


-error  protected- 


Figure  2.  Cyclic  Code  for  Addition 

If  an  error  should  occur  in  the  addition  operation  then  the  result  y  is  given 


y  -  a'  ♦  x'  +  e  -  g-a  +  g-x  +  e  -  g-(a  +  x)  +  e 

Thus,  provided  g  vas  chosen  so  that  e  *  0  MOD  g,  for  all  e  *  0,  the  error  can  be 
detected  by  the  occurrence  of  a  non-zero  residue  of  y  modulo  g  (i.e.  by  the  fact 
that  y  is  not  a  multiple  of  g).  Indeed  g  may  in  fact  be  selected  so  that  an 
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error  value  e  can  be  uniquely  recovered  from  the  erroneous  result  and  hence  the 
correct  result  found.  Consider  the  integers  taken  modulo  313: 

Z513  .  {0,1,2 . 510,511,512). 

A  single  error  correcting  code  results  from  the  choice  g  *  19: 

19*5  +  19*3  -  95  +  57  -  152  -  19*8. 


3.  the  error  value  is  unique  provided  that  the  least  complex  error  is  chosen. 
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However  if  an  error  occurs  (a  faulty  carry  circuit  for  bit  3,  say): 

19*5  -*  1011111 
♦  19*3  -*  0111001 

10010000  -*  144. 

Nov  144/19  *  7  rem.  11,  and  the  non-zero  remainder  indicates  an  error.  In  fact 
it  can  be  shown  that  the  only  way  to  get  a  remainder  of  11,  vith  a  single  error, 

is  if  (-8)  was  added  into  the  sum,  thus  the  correct  answer  is 

(144  -(-8))/19  «  8. 

Note  that  by  encoding  the  two  inputs  (a  and  x)  in  the  above  fashion  both  arms 

of  the  (dyadic)  operation  are  protected  (as  indicated  by  the  bars  in  figure  2). 

The  theories  of  AN  and  BCH  codes  are  extensive  ( [10] ( 11 ] [ 12] ,  [15] [ 16]  [ 17 ] ) . 
The  setting  for  AN  codes  is  the  ring  of  integers  modulo  2n-l,  for  some  n,  (cf. 
one's  complement  arithmetic)  and  so  AN  codes  are  clearly  a  good  choice  for 
attempting  to  develop  a  theory  of  fault  tolerant  (computer  based)  algebra.  A 
fault  tolerant  algebra  consists  of  redundant  forms  of  multiplication  and 
addition  that  can  detect  or  even  correct  errors.  In  order  to  begin  to  look  at 
this  possibility,  however,  it  is  clearly  necessary  for  the  AN  coding  technique 
to  be  able  to  be  applied  to  multiplication  as  veil  as  addition.  This  is  by  no 
means  straight  forward:  the  AN  code  was  originally  designed  to  combat  errors  in 
addition  only. 

BCH  codes  on  the  other  hand  could  be  used  in  the  field  of  signal  processing 
by  virtue  of  the  equivalence  between  the  convolution  of  finite  sequences  and  the 

j, 

multiplication  ci  polynomials  [18].  The  regime  for  BCH  codes  is  also  that  of 
polynomial  algebra,  hovever  they  also  were  originally  designed  for  fault 
tolerant  addition  and  the  convolution  operation  is  equivalent  to  multiplication. 

One  possible  way  to  achieve  error  protection  in  the  multiplication  process  is 
shovn  in  figure  3.  This  is  essentially  the  technique  used  by  Redinbo  [9]  and 
Chien  [13].  Notice  that  here  only  one  of  the  tvo  inputs  is  encoded  (multiplied 
by  g)  and  thus  only  this  input  is  error-protected.  The  reason  behind  this  is  the 

4.  The  coefficients  of  the  polynomials  are  the  elements  of  the  sequences. 
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fact  that  product  of  tvo  multiples  of  g  is  a  multiple  of  g  : 

2 

(g-a)  •  (g-x)  *  g  ■ (a-x) 

Thus  the  product  vould,  in  general,  be  the  encoded  form  not  of  a*x  but  g-a-x 
(see  example  at  the  end  of  section  A. 2. 2).  There  are  other  short  comings  with 
the  use  of  cyclic  codes  vhich  are  discussed  further  in  Appendix  A,  along  with 
some  ideas  on  how  to  overcome  them. 


Figure  3.  Cyclic  Code  for  Hulitiplication 


3.3  Number  Theoretic  Transforms 


Redinbo's  solution  (9]  to  the  above  problem  of  protecting  both  operands  of  a 
multiplication  (specifically  in  GF(q)(xJ/<xn-l>  i.e.  for  polynomials),  vas  to 
use  the  redundancy  of  the  number  theoretic  transform  (NTT)  vhen  the  input  is 
restricted  to  a  proper  subset  of  the  domain.  Redinbo  considered  the  problem  of 
hov  to  error  protect  an  FIR  filter,  i.e.  the  convolution  of  finite  sequences. 

One  way  to  perform  this  operation  is  by  pointvise  multiplication  in  a  transform 
domain  (figure  4.  cf.  convolution  via  DFT). 

It  is  veil  knovn  that  the  input  space  (the  domain)  of  the  DFT  is  the  space  of 
(finite)  sequences  of  complex  numbers.  If  the  input  is  restricted  so  as  to  be  a 
real  sequence,  the  output  sequence  obeys  certain  relationships  as  a  consequence: 
V  (k)  «  V(N-k),  0  <  k  <  N.  The  domain  of  a  NTT  is  the  space  of  finite  sequences 
vith  coefficients  from  a  certain^  extension  field  of  GF(q)  (GF(q")  say^).  If  the 
input  is  then  restricted  to  sequences  over  GF(q),  the  resulting  "spectrum"  again 
obeys  certain  relationships.  These  relationships  are  some  times  collectively 
referred  to  as  "the  chord  property"  of  the  sequence.  The  use  of  the  chord 
property  allows  the  input  variable  to  be  checked  for  errors  -  the  coefficient 


5.  chosen  so  that  it  contains  a  primitive  Nth  root  of  unity 

6.  it  nay  turn  out  that  n«l. 
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being  precoaputed  (as  signified  by  the  brackets  in  figure  4). 


e  GF(q)  - > - jwTT  [ —  X  e  GF(q  )  — >-Tj — > — jlNTT  j - >  a*x 


{a  e  GF(q)  • • •>• 


NTT 


-} 


m  n 

A  e  GF(q  )  — >- 


Figure  4.  Convolution  via  NTT 


Specifically,  the  chord  property  of  the  NTT  is 


<vj>q  -  v<or>)„  i-1"- 

vhere  V  c  GF(qm)n  is  the  NTT  of  v  c  GF(q)n.  In  effect  this  says  that  the 
transform  coefficients  that  are  indexed  by  the  elements  of  a  certain  subset  (a 
cyclotomic  subset)  can  be  determined  from  just  one  coefficient.  Thus  any  error, 
not  itself  an  allowable  input,  will  cause  the  chord  property  to  be  violated  and 
hence  the  error  can  be  detected-  The  correct  value  can  be  found  from  the  other 
coefficients  and  the  above  formula.  If  there  are  Lj  members  of  the  cyclotomic 
subset  C(j)  then  the  minimum  distance  of  the  "code"  is  Lj  and  hence  it  can 
detect  Lj-1  errors. 

As  the  NTT  is  linear,  it  commutes  with  the  addition  operator  and  so  it  is 
possible  to  protect  the  addition  of  sequences  as  well  as  their  convolution.  This 
then  constitutes  a  fault  tolerant  algebra  of  finite  sequences.  The  exact  error- 
protection  power  of  such  an  algebra  is  as  yet  unknown.  It  is  also  worth  noting 
that  the  algebra  of  polynomials  is  also  the  setting  for  most  Galois  fields.  It 
therefore  seems  likely  that  this  technique  can  be  used  to  construct  a  fault 
tolerant  Galois  field  algebra. 

3.4  Residue  Codes 


The  main  draw-back  vith  AN  codes  is  the  fact  that  the  minimum  distance  of  the 
code  tends  to  vary  inversely  vith  the  size  of  code  (number  of  codewords).  Thus 
vhilst  it  is  possible  to  have  single  error-correcting  codes  vith  a  reasonable 
number  of  codewords,  the  large  distance  codes  tend  to  have  very  small  sizes  e.g. 
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the  generator  A  «  13797  produces  a  distance  6  code  (double  error  correcting  and 

18 

triple  error  detecting)  but  the  arithmetic  has  to  be  done  modulo  2  -1  so  there 

are  only  (2*®-l)/13797  «  19  codewords. 

A  more  attractive  option  is  residue  codes.  These  can  be  constructed,  in 
amongst  other  ways,  from  AN  codes.  The  minimum  distance,  and  thus  the  error- 
protection  capability,  is  preserved  and  the  size  of  the  code  is  that  of  the  ring 

e.g.  the  residue  code  based  on  the  AN  code  with  A  -  13797  (see  above)  has 
18 

distance  6  and  2  -1  codewords.  Residue  codes  also  work  just  as  veil  for 
multiplication  as  for  addition.  The  main  draw-back,  if  it  is  one,  is  that  the 
coded  form  of  a  number  is  in  the  form  of  a  vector  and  the  arithmetic  algorithms 
for  the  individual  components  are  different  from  one  another.  This  clearly  adds 
to  the  complexity  of  the  system. 

A  residue  code  vorks  by  keeping  a  check  on  the  number  by  means  of  a  separate 
"check  number".  The  check  number  is  the  result  of  performing  the  same  arithmetic 
operations  on  the  check  number  as  on  the  original  number,  except  that  the 
arithmetic  is  modulo  a  suitable  base.  The  check  arithmetic  can  thus  be  less 
complex  than  the  main  arithmetic  e.g.  a  check  modulo  2  would  only  require  1  bit 
arithmetic.  Error  detection  is  afforded  by  a  comparison  betveen  the  check  number 
and  the  main  result  modulo  the  check  base:  a  modulo  2  check  clearly  detects  any 
error  that  turns  an  even  number  into  an  odd  one,  and  vice  versa.  These  modulo 
sums  are,  for  obvious  reasons,  referred  to  as  "checkers"  (see  figure  5). 


Figure  5.  Residue  Coded  Arithmetic. 


Clearly  it  is  possible  for  one,  or  more,  of  the  checkers  to  be  in  error  as 
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veil  as  the  aain  result.  This  makes  the  task  of  error  protection  slightly  more 
difficult  for  the  residue  codes  as  compared  to  that  for  AN  codes.  The  theory  for 
this  is  however  veil  established.  It  should  be  noted  that  as  the  full-range 
ansver  is  available  the  problem  of  sign  determination  and  magnitude  comparison 
that  besets  RNS  is  not  present. 

3.5  Invariant  properties 

3.5.1  length,  trace  etc. 

In  certain  mathematical  operations  various  properties  of  the  input  variables 
are  invariant  e.g.  length  of  vectors  under  unitary  transformations,  matrix  trace 
under  addition  or  multiplication.  Such  properties  could  be  used  to  check  on  the 
final  result  i.e.  for  error  detection.  It  is  not  clear  as  yet  whether  they  can 
be  used  for  error  correction. 

3.5.2  Operator  Invariant  Subspaces. 

There  appears  to  exist  a  certain  amount  of  vork,  at  least  in  finite  linear 
algebra  ((20)  section  11.2),  on  subspaces  that  are  Invariant  under  a  (linear) 
operator.  Further  investigation  is  called  for. 

3.5.3  Spectral  Hethods 

Blahut  (17]  has  shown  that  there  exists  a  "frequency”  domain  representation 
of  (BCH)  error  correcting  codes.  Such  a  representation  appears  to  have  several 
advantages  from  a  coding  point  of  view  including  the  fact  that  the  syndrome  of  a 
received  word  is  just  its  DFT.  The  error  correcting  properties  arise  from  the 
fact  that  the  encoder  can  be  considered  to  be  a  filter  that  has  a  certain 
distribution  of  zeros.  When  errors  (noise?)  are  added  to  the  filtered  (encoded) 
sequence  the  result  no  longer  has  a  spectrum  with  these  zeros.  The  presence  of 
errors  can  thus  be  detected  by  the  non-zero  spectral  coefficients  (via  the  DFT). 
Error  correction  results  from  a  reconstruction  of  the  error  sequence  based  on  a 
knowledge  of  some  of  its  DFT  coefficients  (i.e.  those  at  the  zero  positions). 

There  may  be  some  possibility  of  exploiting  the  features  of  a  convolution 
operator  to  obtain  a  degree  of  error  protection  vithout  adding  (further?) 
redundancy.  Under  suitable  interpretations  of  "frequency"  such  an  approach  may 
well  vork  for  any  linear  operator,  including  matrix  operators. 
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3.6  Codes  over  the  Real  Field. 

Host  of  the  techniques  reviewed  so  far  are  based  on  finite  algebraic 
structures  and  so  are  only  suitable  for  integer  problems.  As  most  problems  of 
interest  are  naturally  represented  in  the  Real/Complex  field,  it  is  desirable  to 
consider  error  protection  over  the  real  field. 

The  Most  veil  known  of  the  (data  transmission)  error  correction  codes,  the 
block  and  the  convolution  codes,  are  based  on  the  idea  of  convolving  the  data 
sequence  with  a  fixed  sequence.  The  result  of  this  convolution  has  certain 
properties  (e.g.  zeros  at  certain  frequency  values)  that  enable  any  additive 
noise  to  be  detected  and  then  reeoved.  The  extension  to  real  sequences  should  be 
considered.  The  similarity  of  the  Berlekamp  decoding  algorithm  and  the 
Yule-Valker  equations  in  Linear  prediction/Kalman  filters  constitutes  an 
appealing,  if  not  conclusive,  argument  in  favour  the  existence  of  Real  valued 
codes . 

3.7  Codes  for  Matrices 

3.7.1  2D  Codes 

There  exists  a  certain  amount  of  work  (e.g.  [19))  on  the  subject  of  2-D  error 
correction  codes.  These  coding  schemes  are  based  on  the  idea  of  appending  check 
rovs  and  columns  to  a  data  matrix.  Each  check  digit  is  the  result  of  a  parity 
check  over  a  rov/column.  Clearly  such  codes  are  the  obvious  extensions  to 
Abraham's  work.  It  is  not  knovn  at  present7  what  properties  these  codes  have 
i.e.  vhether  or  not  they  are  linear  and  hence  invariant  under  addition  and 
scalar  multiplication. 

3.7.2  Ideals  in  M_  _(K). 

- - nxnr  — 

As  most  of  the  successful  error  correcting  codes  to  date  are  based  on  the  use 
of  an  ideal  as  the  code  space,  it  should  be  fruitful  to  investigate  the 
structure  of  ideals  in  the  ring  of  nwn  matrices  over  a  field  K  (Mnxm(K)).  The 
pure  mathematics  of  this  problem  has,  of  course,  already  been  solved  ( 20 1 . 
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7.  by  the  author 
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4  Conclusion 


The  field  of  algorithmic  fault  tolerance  appears  to  be  vide  open,  with  only  a 
handful  of  people  actively  engaged  in  relevant  vork.  There  is  probably  still  a 
lot  of  vork  in  the  area  of  (hardvare)  arithmetic  codcs/systems  that  could  be 
pertinent  to  the  algorithmic  case.  Further  study  of  the  research  literature,  as 
opposed  to  text  books,  is  required. 

Certainly  there  are  still  a  lot  of  areas/techniques  (see  section  3)  that,  at 
first  sight,  look  promising  and  varrant  closer  investigation.  There  are  several 
problems  vith  the  use  of  AN  codes  for  protecting  complicated  algebra.  The 
advantages  of  such  a  technique  are,  hovever,  such  that  more  effort  could  be 
justified  in  this  direction.  It  must  hovever  be  remembered  that  (useful)  AN 
codes  are  only  single  error  correcting  and  that  for  higher  error  protection  a 
svitch  to  the  use  of  residue  codes  vould  be  required.  The  latter  are  at  least  as 
good  as,  and  often  better  than,  AN  codes  in  all  respects  except  for  the 
complexity  of  the  arithmetic  (vector  instead  of  scalar).  Exactly  hov  much  of  a 
penalty  this  is  remains  to  be  seen. 

There  is  still  a  lot  of  vork  to  be  done,  even  for  established  methods  like  AN 
and  residue  codes  and  Redinbo's  BCH-protected  convolution,  in  establishing 
exactly  hov  many  and  vhat  sort  of  errors  can  be  tolerated.  Care  must  be  taken  in 
this  question  as  to  the  definition  of  an  error.  The  conventional  approach,  in  AN 
and  residue  codes,  is  to  assume  a  serial  ripple-adder  type  circuit  and  so  define 
an  error  as  an  additive  difference  of  ±2*.  Increasingly  in  VLSI,  arithmetic 
circuits  are  pipelined  or  based  on  redundant  number  representations  for  vhich 
the  definition  of  ± 2 1  as  an  single  error  may  not  be  suitable.  Also  the  question 
of  determining,  in  terms  of  percentage  overheads,  hov  much  the  error  protection 
costs  must  not  be  neglected.  The  underlying  system  that  Redinbo  proposed  for  a 
fault  tolerant  FIR  filter  has  been  knovn  for  a  long  time  and  has  not  been 
implemented  in  many  real-time  systems  presumably  due  to  the  complexity  involved. 

The  anticipated  possibilities  of  the  spectral  approach  to  BOB  codes  as 
applied  to  the  protection  of  FIR  filters,  and  linear  operators  in  general,  seem 
to  be  great.  This  is  the  probably  the  next  topic  to  vhich  to  devote  the  most 
effort.  The  use  of  an  extension  field  (section  3.3)  and  the  resultant  redundancy 
in  the  arithmetic  could  prove  fruitful  depending  on  the  cost  of  implementing 
finite  field  arithmetic,  vhich  has  in  the  past  proved  to  be  great.  The  tvo  other 
topics  that  appear  to  hold  the  most  immediate  potential  are  the  2-D  codes 
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(section  3.7)  and  the  BCH/Yule-Valker  connection  (section  3.6). 
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APPKHPII  A 


Fault  Tolerant  Ring  Algebra 
A.l  Perceived  Problems  and  Some  Conjectures 

As  explained  in  section  3.2  in  order  to  have  a  viable  fault  tolerant  algebra 
using  AN  or  BCH  codes,  a  multiplication  operator  that  commutes  with  the  coding 
operation  is  required.  Previous  work  19)113]  suggested  a  method  whereby  only  one 
of  the  operands  is  encoded.  This  is  because  of  the  presence  of  an  extra  factor 
of  g  -  the  code  generator  -  in  the  product  if  both  operands  are  encoded. 

Although  knowledge  of  the  presence  of  this  extra  factor  would  allow  a  division 
by  another  factor  of  g  and  thus  restore  the  correct  value,  the  correction  would 
have  to  be  performed  after  each  multiplication  and  not  at  the  end  of  the 
calculation  as  desired.  Further  in  a  complex  algorithm  it  is  usually  necessary 
to  perform  more  than  one  multiplication,  thus,  even  with  only  one  input  per 
multiplication  encoded,  factors  of  g  build  up  vith  each  multiplication  (see 
figure  Al).  In  calculations  that  contain  products  of  sums  the  factors  of  g  would 
be  hopelessly  caught-up  in  the  rest  of  the  expression  with  no  hope  of  removing 
them. 


(encoded  operands  shown  in  bold) 


2 

>  gfg  acxz  +  gbcj 


Figure  Al.  Arithmetic  with  AN  Coded  Operands 

Even  if  a  correction  were  applied  after  each  multiplication  there  still  are 
short-comings  with  this  approach.  Consider  the  basic  mechanism  of  error 
protection  for  the  product  a*x.  If  the  coefficient  a  is  encoded  as  a'  «  g-a  then 
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the  result  of  a  faulty  calculation  could  be 


(x  +  ejHga  +  e2)  +  e3  -  g(ax)  +  (xe2  +  e^ga  +  e^  +  e3) 

where  e^  is  an  error  associated  vith  the  inputing  and  any  preprocessing  of 
the  variable  x, 

e2  is  due  to  the  erroneous  storage  of  the  coefficient  a  , 
and  e3  accounts  for  any  errors  on  the  main  section  of  the  Multiplier. 

As  before  the  required  result  (ax)  is  a  multiple  of  g  but  then  so  is  the  error 
factor  (e^ga).  Thus,  vith  a  proper  choice  of  generator  g,  it  nay  be  possible  to 
protect  the  circuit  against  errors  of  the  fora  e2  and  e^  but  it  is  never 
possible  to  provide  protection  against  corruption  of  the  variable  x.  It  is  also 
worth  noting  at  this  point  that  vhereas  the  standard  theory  will  work  for  the 
error  e^,  the  error  e2  appears  as  a  aultiple  of  x  so  that  extra  care  must  be 
taken  that  possible  error  values  (e2)  and  input  values  (x)  are  not  such  that  the 
product  e2x  is  a  aultiple  of  g.  If  this  vere  the  case  then  the  error  will 
effectively  be  Basked  by  the  particular  input  value.  The  same  can  be  said  vith 
reference  to  the  term  (e^).  A  possible  solution  Bay  be  to  choose  g  such  that 
it  is  priae  (e.g.  the  Brovn-Peterson  AN  codes  -  see  (10]  plOl). 

A  solution  to  the  problea  of  accruing  factors  of  g  is  the  use  of  idempotent 
generators.  Under  certain  conditions  it  can  be  shovn  that  there  exists  a 
generator  for  a  cyclic  code  that  is  ideapotent  i.e.  y  «  r  (see  section  A. 2). 
Using  such  a  generator  Beans  that  not  only  is  the  problem  of  accumulating 
factors  of  (in  this  case)  y  solved  but  also  both  operators  could  now  be  encoded 
in  the  multiplication  process  (figure  A2)  or  aore  significantly  that  the  input 
aay  be  left  in  coded  form  (from  the  previous  operation). 


Figure  A2.  Ideapotent  Cyclic  Code  for  Multiplication 
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This  does  not,  however,  eean  that  the  input  is  protected  by  the  code.  On  the 
contrary  having  the  input  as  a  multiple  of  y  can  be  considered  to  stake  matters 
worse.  Consider 

(yx  ♦  ejXra  +  e2)  +  e3  -  r(ax)  +  (yxe2  +  ejya  +  e^  +  e3) 

Here  the  output  error  terns  yxe2,  e^ys  ere  both  multiples  of  y  and  hence  are 
undetectable.  Thus  any  error  associated  with  the  storage  of  the  coefficient  a 
(e2)  can  only  be  detected  in  conjunction  with  an  input  error  e^  via  the  e^ 
tern. 

One  solution  to  this  problem  could  be  to  use  a  mixture  of  codes.  The  input 
variable  (x)  could  be  encoded  using  one  code,  and  the  coefficient  (a)  using 
another: 

(g3x  +  e1)(g2a  ♦  e2)  ♦  «3  «  gjg2(ax)  ♦  (gjXe2  +  e^a  ♦  e3e2  +  e3> 

Such  coding  may  then  afford  protection  against  errors  of  type  e^,  e2>  Vhat  is 
then  required  is  to  be  able  to  combat  type  e^  errors  in  such  a  way  that  the 
output  of  the  multiplier  is  encoded  in  the  same  code  as  x.  This  might  be 
achieved  by  requiring  that  the  generator  for  the  coefficient  code  be  an 
idempotent  with  respect  to  the  generator  of  the  input  code  i.e.  such  that 
®l'g2  *  ®l  but  this  re,luires  further  research. 

Q 

McVhirter  has  pointed  out  that  the  problem  of  masking  errors,  particularly 
input  errors  (type  e^),  in  itself  may  not  be  that  disastrous  since  the  inputs  to 
each  multiplier  could  be  checked.  This  is  clearly  not  as  satisfactory  as  a 
single  check  at  the  end  of  a  (complex)  calculation  but  is  better  than  having  to 
check  the  inputs  to  each  adder  as  well  as  each  multiplier. 

Given  the  elegance  of  residue  codes  it  may  be  argued  that  AN  codes  have  no 
use  in  fault  tolerant  systems.  The  author  believes,  hovever,  that  this  is  not 
necessarily  so.  There  are  many  low-distance  AN  codes  that  have  quite  a  large 
number  of  codewords  and  hence  have  some  practical  value.  The  lack  of  error 
correcting  power  nay  not  be  all  that  prohibitive  if  the  expected  frequency  of 
errors  is  low  and  detection,  rather  than  correction,  is  acceptable.  One 
application  that  springs  to  mind  is  in  residue  code  systems.  Here  (see 


i  '7 
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8.  private  communication 
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section  3.4)  it  is  possible  for  the  checkers  to  be  in  error,  however  in  this 
case  it  is  not  necessary  to  correct  the  result:  knowing  it  is  in  error  is 
sufficient.  The  checker  circuits  could  then  benefit  from  being  coded  for  error 
detection. 

AN  codes  also  have  the  advantage  that  the  arithmetic  is  "standard",  unlike 
residue  codes  which  have  a  parallel  structure,  thus  AN  codes  could  be  used  in 
on-line  testing  of  conventional  circuits.  The  basic  fact  that  allows  any  coding 
scheme  to  vork  is  that  the  codewords  are  recognisable  as  such.  In  the  case  of 
cyclic  codes,  they  are  all  multiples  of  the  generator.  As  AN  codes  are  designed 
such  that  the  codewords  remain  codewords  under  addition  and,  to  a  certain 
extent,  multiplication,  if  the  input  to  an  algorithm  are  AN  codewords  then  the 
output  should  be  one  also.  Thus  the  inputs  to  a  system  could  be  periodically  put 
in  AN  coded  form  and  the  output  checked  to  see  if  it  is  also  a  codeword.  It  may 
not  be  possible  to  detect  all  errors  (see  above)  but  certainly  some  errors  will 
be  found. 
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Definition: 


Let  R  be  a  ring,  then  e  e  R  is  an  ideepotent  if 


e 


2 


e 


Theorem: 

Q 

Let  K  be  a  Euclidean  domain.  An  ideal  <g>  in  the  Euclidean  domain 
R  «  K/<j>,  j  e  K  and  g  |  j,  can  be  generated  by  a  unique  idempotent  y, 
provided  (g,j/g)  ■  1. 


E.g.  let  K  be  the  ring  of  integers,  if  j  «  65  the 
ideal  <j>  is  the  set  of  integer  multiples  of  65  and 
the  Euclidean  domain  R/<j>  is  the  ring  of  integers 
modulo  65  (i.e.  Zfi5).  Nov  consider  the  ideal  of  Z65 
generated  by  g  »  13  (i.e.  all  integers  that  are 

multiples  of  13  modulo  65): 

<g>  .  {13,26,39,52,65-0) 

On  the  other  hand 

<26>  .  (26,52,78-13,104-39,130-0)  -  <g>, 


and 


(26)2  -  676  ■  26  MOD  65. 


Proof:  (cf.  {15]  ch.8) 

Define  h  -  J/g  .  If  (h,g)  -  1  then,  by  the  Euclidean  division 
algorithm,  there  exists  p,  q  c  R  such  that 

pg  +  qh  -  1 


Consider  y  «  pg  e  R, 


i.e. 


y(y  ♦  qh)  -  y 
2 

y  ♦  pgqh  -  y 


9.  <g>  is  the  principal  ideal  generated  by  the  element  g. 

10.  K/<j>  is  the  ring  of  residue  classes  of  K  modulo  <j> 
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Multiplication  is  commutative  in  an  Euclidean  domain,  hence 


Thus 


pgqh  «  (pqHgh)  -  (pq)j  -  0 


and  r  is  an  idempotent. 

In  the  case  vhere  R  -  Zgj  and  g  «  13,  ve  have  h  « 

65/13  -  5.  As  13  and  5  are  relatively  prime  ve  can 

find  integers  p,  q  such  that 

13p  ♦  5q  ■  1  MOD  65. 

In  fact  p  «  2,  q  »  8  and  hence  y  «  2*13  »  26. 


Nov  as 

then 

but 


i.e. 

Thus 

and  y  generates  the  ideal. 


r  -  pg 

<Y>  5  <g>, 

gY  »  g(l  -  qh) 

*  g  -  qj 

•  8 

<g>  C  <T> 
<g>  -  <Y>. 


Clearly  the  ideal  generated  by  the  element  13  must 
contain  the  element  26,  so  that  <26>  will  be  a  subset 
of  <13>. 

But  as  13*26  -  338  »  13  MOD  65,  the  element  13  can  be 
considered  to  be  a  multiple  of  26  a.td  hence  <13>  c 
<26> 


21 


The  idempotent  r  is  unique,  for  if  f  e  R  is  another  ideopotent  that  generates 
<g>  then 

y  e  <f> 


i.e. 

Thus 

Similarly 


Y  »  af 


i.e. 

thus 

Hence 


f  «  bY 


for  soae  a  e  R 
Yf  ■  af2  ■  af  ■  Y* 

f  £  <Y> 

for  soae  b  e  R 
Yf  «  bY2  »  bv  -  f. 

Y  »  f- 


If  the  element  f  generated  the  ideal  then  the  element 
26  must  be  a  multiple  of  f:  26  m  af  HOD  65. 

Hence 

26f  m  (af )f  a  af2  HOD  65. 

2 

If  f  is  also  an  idempotent  (i.e.  f  «  f)  then 
26f  a  af  a  26  HOD  65. 

Ve  already  knov  that  the  element  26  generates  the 
ideal,  so  that  f  must  be  some  multiple  of  26:  26b  say. 
Then 

26f  a  26(26b)  a  26b  a  f  HOD  65. 

Hence  f  ■  26  and  the  element  26  is  the  only  idempotent 
generator. 
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A. 2.1  Idempotent  Generators  for  BCH  Codes 


A  BCH  code  is  an  ideal  in  GF(q)[x]/<xn-l>,  for  (q,n)  ■  1,  generated  by  g(x)  a 
factor  of  xn-l.  It  is  veil  known  that  the  roots  of  g(x)  consist  of  conjugate 
sets  (1151  pl99)  and  hence  that  (g(x),(xn-l)/g(x>)  «  1.  Thus  the  conditions  of 
the  above  theorem  hold  and  the  code  can  be  generated  by  the  ideapotent 

y(x)  -  p(x)g(x) 

where  p(x)  a  (g(x))-1  HOD  (xn-l)/g(x). 


A. 2.2  Idempotent  Generators  for  AN  Codes 

A  cyclic  AN  code  is  an  ideal  in  Z^n  ,,  generated  by  A,  a  divisor  of  2n-l. 

Thus  provided  that  (A,2n-1/A)  >  1,  the  code  can  be  generated  by  the  idempotent 

a  »  pA 

where  p  -  A'1  MOD  2n-l/A. 

It  is  not  known ^  if  the  condition  that  A  and  (2n-l)/a  be  relatively  prime 
can  always  be  met.  The  condition  certainly  holds  for  some  values  of  A.  Consider 
the  single  error-correcting  AN  code  in  the  ring  Z65>  with  A  «  13  and  0  <  N  <  5. 
Here  we  have 

ct  *  26 . 

To  see  the  code  at  work  in  a  multiplication  consider  the  calculation 

(a  ♦  b)-c  MOD  5 

where  a  -  2,  b  «  4,  c  «  3.  Encoding  the  values  a,  b,  c  as  AN  codewords  we  get 

a'  .  52, 

b'  -  92  i  39  MOD  65, 
c'  -  78  ■  13  MOD  65. 

With  perfect  arithmetic  we  get 

(52  *  39)*13  a  26*13  ■  3*26  MOD  65 

Which  is  the  coded  form  of  3,  the  correct  answer.  Repeating  the  calculation  with 
the  original  generator  A  »  13,  ve  find: 

ar  •  26 
b'  -  52 
c’  -  39 

and  hence 

(26  +  52)39  -  78*39  a  52  HOD  65. 


11.  by  the  author 


23 


But  52  .  13*4  i.e.  the  coded  form  of  the  number  4,  which  is  not  the  correct 
ansver . 
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